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Abstract. The Swap-Insert Correction distance from a string S of length 
n to another string L of length m > n on the alphabet [l..d] is the 
minimum number of insertions, and swaps of pairs of adjacent symbols, 
converting S into L. Contrarily to other correction distances, comput¬ 
ing it is NP-Hard in the size d of the alphabet. We describe an algo¬ 
rithm computing this distance in time within 0{d?nmg'^~^), where there 
are Ua occurrences of a in S, rria occurrences of a in L, and where 
g = msLXa£[i..d] min{na, ma — na} measures the difficulty of the instance. 
The difficulty g is bounded by above by various terms, such as the length 
of the shortest string S, and by the maximum number of occurrences of 
a single character in S. Those results illustrate how, in many cases, the 
correction distance between two strings can be easier to compute than 
in the worst case scenario. 
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1 Introduction 

Given two strings S and L on the alphabet S = and a list of correction 

operations on strings, the String-to-String Correction distance is the min¬ 
imum number of operations required to transform the string S into the string L. 
Introduced in 1974 by Wagner and Fischer [7], this concept has many appli¬ 
cations, from suggesting corrections for typing mistakes, to decomposing the 
changes between two consecutive versions into a minimum number of correction 
steps, for example within a control version system such as cvs, svn or git. 

Each distinct set of correction operators yields a distinct correction distance 
on strings. For instance, Wagner and Fischer [7] showed that for the three fol¬ 
lowing operations, the insertion of a symbol at some arbitrary position, the 
deletion of a symbol at some arbitrary position, and the substitution of 
a symbol at some arbitrary position, there is a dynamic program solving this 
problem in time within 0{nm) when S is of length n and L of length m. Sim¬ 
ilar complexity results, all polynomial, hold for many other different subsets of 
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the natural correction operators, with one striking exception: Wagner [B] proved 
the NP-hardness of the Swap-Insert Correction distance, denoted S{S,L) 
through this paper, i.e. the correction distance when restricted to the operators 
insertion and swap (or, by symmetry, to the operators deletion and swap). 

The Swap-Insert Correction distance’s difficulty attracted special inter¬ 
est, with two results of importance: Abu-Khzam et al. [T] described an algorithm 
computing S{S,L) in time within 0(I.6181'^*''^’^^m), and Meister [3] described 
an algorithm computing d{S^ L) in time polynomial in the input size when S 
and L are strings on a finite alphabet: its running time is (m -I- . (71 _(- 1)2 

times some polynomial function on n and m. 

The complexity of Meister’s result [3], polynomial in m of degree 2(i -|- 1, is a 
very pessimistic approximation of the computational complexity of the distance. 
At one extreme, the Swap-Insert Correction distance between two strings 
which are very similar (e.g. only a finite number of symbols need to be swapped 
or inserted) can be computed in time linear in n and d. At the other extreme, the 
Swap-Insert Correction distance of strings which are completely different 
(e.g. their effective alphabets are disjoint) can also be computed in linear time 
(it is then close to n -I- m). Even when S and L are quite different, 6{S,L) can 
be “easy” to compute: when mostly swaps are involved to transform S into L 
(i.e. S and L are almost of the same length), and when mostly insertions are 
involved to transform S into L (i.e. many symbols present in L are absent from 
5 ). 

Hypothesis: We consider whether the Swap-Insert Correction distance 
5{S, L) can be computed in time polynomial in the length of the input strings 
for a constant alphabet size, while still taking advantage of cases such as those 
described above, where the distance 5{S, L) can be computed much faster. 

Our Results: After a short review of previous results and techniques in Sec¬ 
tion [21 we present such an algorithm in Section |3l in four steps: the intuition 
behind the algorithm in Section 13.11 the formal description of the dynamic pro¬ 
gram in Section 13.21 and the formal analysis of its complexity in Section 13.41 In 
the latter, we define the local imbalance ga = minjna, TOq, — Rq} for each symbol 
a G S, summarized by the global imbalance measure g = max^g^ g^^ and prove 
that our algorithm runs in time within 

O id{n + m)+ d^n ■ ^ {rria - ga) ■ (ffa + 1) , 

y a=l aes+ J 

in the worst case over all instances of fixed sizes n and m, with imbalance vector 
(gi,... ,gd)', where E+ = {a G A : > 0} if = 0 for any a G E, and 

A+ = A \ {argminctgi; 5 c(} otherwise. This simplifies to within 0{d^g'^~^nm) in 
the worst case over instances where d, n, m and g are fixed. 

We discuss in Section [3] some implied results, and some questions left open, 
such as when the operators are assigned asymmetric costs, when the algorithm 
is required to output the sequence of corrections, when only swaps are allowed, 
or when the distribution of the frequencies of the symbol is very unbalanced. 


2 Background 


In 1974, motivated by the problem of correcting typing and transmission errors, 
Wagner and Fischer [7] introduced the String-to-String Correction prob¬ 
lem, which is to compute the minimum number of corrections required to change 
the source string S into the target string L. They considered the following oper¬ 
ators: the insertion of a symbol at some arbitrary position, the deletion of a 
symbol at some arbitrary position, and the substitution of a symbol at some 
arbitrary position. They described a dynamic program solving this problem in 
time within 0{nm) when S is of length n and L of length m. The worst case 
among instances of fixed input size n -|- m is when n = m/2, which yields a 
complexity within O(n^). 

In 1975, Lowrance and Wagner [S] extended the String-to-String Cor¬ 
rection distance to the cases where one considers not only the insertion, 
deletion, and substitution operators, but also the swap operator, which ex¬ 
changes the positions of two contiguous symbols. Not counting the identity, 
fifteen different variants arise when considering any given subset of those four 
correction operators. Thirteen of those variants can be computed in polynomial 
time [Slzll]. The two remaining distances, the computation of the Swap-Insert 
Correction distance and its symmetric the Swap-Delete Correction dis¬ 
tance, are equivalent by symmetry, and are NP-hard to compute [5], hence our 
interest. All our results on the computation of the Swap-Insert Correction 
distance from S to L directly imply the same results on the computation of the 
Swap-Delete Correction distance from L to S. 

In 2011, Abu-Khzam et al. [I] described an algorithm computing the Swap- 
Delete Correction distance from a string L to a string S (and hence the 
Swap-Insert Correction from S to L). Their algorithm decides if this dis¬ 
tance is at most a given parameter fc, in time within 0(1.6181^m). This indirectly 
yields an algorithm computing both distances in time within 0(1.6181'^*''^’^^m): 
testing values of k from 0 to infinity in increasing order yields an algorithm com¬ 
puting the distance in time within 1.6181 ^to) C 0(1.6181'^^'^’^^m). 

Since any correct algorithm must verify the correctness of its output, such an 
algorithm implies the existence of an algorithm with the same running time 
which outputs a minimum sequence of corrections from S to L. Later in 2013, 
Watt [5] showed that computing the Swap-Deletion Correction distance 
has a kernel size of 0{k*). 

In 2013, Spreen observed that Wagner’s NP-hardness proof [B] was based 
on unbounded alphabet sizes (i.e. the Swap-Insert Correction problem is 
NP-hard when the size d of the alphabet is part of the input), and suggested 
that this problem might be tractable for fixed alphabet sizes. He described some 
polynomial-time algorithms for various special cases when the alphabet is binary, 
and some more general properties. 

In 2014, Meister [1] extended Spreen’s work to an algorithm computing 
the Swap-Insert Correction distance from a string S of length n to another 
string L of length m on any fixed alphabet size d > 2, in time polynomial 
in n and m. The algorithm is explicitly based on finding an injective function 




tp : [l..n] ^ such that pi{i) = j if and only if S[i] = L[j], and the total 

number of crossings is minimized. Two positions z < z' of S' define a crossing if 
and only if ip{i) > p(i'). Such a number of crossings equals the number of swaps, 
and the number of insertions is always equal to m — n. Meister proved that the 
time complexity of this algorithm is equal to (m + ■ (n + 1)^ times some 

function polynomial in n and m. 

We describe in the followiirg section an algorithm computing the Swap- 
Insert Correction distance in explicit polynomial time, and which running 
time goes gradually down to linear for easier cases. 

3 Algorithm 

We describe the intuition behind our algorithm in Section 13.11 the high level 
description of the dynamic program in Section 15^ the full code of the algorithm 
in Section 13.31 and the formal analysis of its complexity in Section 13.41 


3.1 High level description 

The algorithm runs through S and L from left to right, building a mapping 
from the characters of S' to a subset of the characters of L, using the fact that, 
for each distinct character, the mapping function on positions is monotone. The 
Dynamic Programming matrix has size ni x ■ ■ ■ x < n‘^. 

For every string X € {S, L}, let X[i] denote the z-th symbol of X from left to 
right (z S [l..|X|]), and X[i..j] denote the substring of X from the z-th symbol to 
the j-th symbol (1 < z < j < |-^|). For every 1 < j < z < rz, let X[i..j] denote the 
empty string. Given any symbol a € £, let rank{X,i,a) denote the number of 
occurrences of the symbol a in the substring X[l..z], and seleci{X,k,a) denote 
the value j G [l..|X|] such that the fc-th occurrence of a in X is precisely at 
position J, if j exists. If j does not exist, then select{X,k,a) is null. 

The algorithm runs through S and L simultaneously from left to right, skip¬ 
ping positions where the current symbol of S equals the current symbol of L, and 
otherwise branching out between two options to correct the current symbol of 
S: inserting a symbol equal to the current symbol of L in the current position of 
S, or moving (by applying many swaps) the first symbol of the part not scanned 
of S equal to the current symbol of L, to the current position in S. 

More formally, the computation of S{S, L) can be reduced to the application 
of four rules: 

— if S' is empty: We just return the length \L\ of L, since insertions are the 
only possible operations to perform in S. 

if some a G S appears more times in S than in L: We return -|-oo, 
since delete operations are not allowed to make S and L match. 

— if S and L are not empty, S[l] = L[I]: We return J(S[2..|S|], L[2..|T|]). 

— if S and L are not empty, S[l] ^ L[l\. We compute two distances: the 
distance dms = I + (5(S, L[ 2 ..|L|]) corresponding to an insertion of the 


symbol L[\] at the first position of S, and the distance dswaps = (?'“!) + 
<5(5", L[2..|L|]) corresponding to perform r — 1 swaps to bring to the first 
position of S the first symbol of S equal to L[l\. In this case, r denotes the 
position of such a symbol, and S' the string resulting from S by removing 
that symbol. We then return min{(ims, d^ujaps}- 

There can be several overlapping subproblems in the recursive definition of 
d(S', L) described above, which calls for dynamic programming [3] and memo- 
ization^. In any call 6{S', L') in the recursive computation of 5{S, L), the string 
L' is always a substring L[j..| J|] for some j £ [l..| J|], and can thus be replaced 
by such an index j, but this is not always the case for the string S' . Observe 
that S' is a substring S'[i..|5'|] for some i G [l..|S'|] with (eventually) some sym¬ 
bols removed. Furthermore, if for some symbol a £ S precisely Cq symbols a of 
iS'[i..|S'|] have been removed, then those symbols are precisely the first Cq sym¬ 
bols a from left to right. We can then represent S' by the index i and a counter 
Cq, for each symbol a £ E oi how many symbols a of S'[i..|5'|] are removed (i.e. 
ignored). In the above fourth rule, the position r is equivalent to the position of 
the (ci[i] -|- l)-th occurrence of the symbol L[l] in 5'[f..|5'|]. To quickly compute 
r, the functions rank and select will be used. 

Let W = na=i[0--^a] denote the domain of such vectors of counters, where 
for any c = (ci, C 2 ,..., c^) £ W, Ca denotes the counter for a £ E. Using 
the ideas described above, the algorithm recursively computes the extension 
DIST{i,j,c) of S{S,L), defined for each i £ [l..n -|- 1], j £ [l..m + 1], and 
c = (ci, C 2 ,..., Cd) £ W, as the value of d(5'[i..n]c, L[j..m]), where S[i..n\c is the 
string obtained from S[i..n] by removing (i.e. ignoring) for each a £ E the first 
Ca occurrences of a from left to right. 

Given this definition, 6{S,L) = U/S'T(1,1, 0), where 0 denotes the vector 
(0,..., 0) £ W. Given i, j, and c, DIST{i,j,c) < -|-oo if and only if for each 
symbol a £ E the number of considered (i.e. not removed or ignored) a symbols 
in S')?..?!] is at most the number of a symbols in L[j..m]. That is, count(S, i, a) — 
Ca < count{L,j,a) for all a £ E, where count[X,i,a) = rank{X,\X\,a) — 
rank(\X\,i — l,a) is the number of symbols a in the string Jf[z..|df|]. In the 
following, we show how to compute DIST{i, j,c) recursively for every i, j, and 
c. For a given a £ E, let Wa G W be the vector whose components are all equal 
to zero except the a-th component that is equal to 1. 

3.2 Recursive computation of DIST{i, j,c) 

We will use the following observation which considers the swap operations per¬ 
formed in the optimal transformation from a short string S of length n to a 
larger string L of length m. 

Observation 1 W) The swap operations used in any optimal solution sat¬ 
isfy the following properties: two equal symbols cannot be swapped; each symbol is 


® Cormen et al. [3] explain that memoization comes from memo, referring to the fact 
that the technique consists in recording a value so that we can look it up later. 



always swapped in the same direction in the string; and if some symbol is moved 
from some position to another by performing swaps operations, then no symbol 
equal to it can he inserted afterwards between these two positions. 

The following lemma deals with the basic case where S[i..n] and L[j..m] start 
with the same symbol, i.e. S'[i] = L[j]. When the beginnings of both strings are 
the same, matching those two symbols seems like an obvious choice in order to 
minimize the distance, but one must be careful to check first if the first symbol 
from has not been scheduled to be “swapped” to an earlier position, in 

which case it must be ignored and skipped: 

Lemma 1. Given two strings S and L over the alphabet S, for any positions 
i € [l..n] in S and j G in L, for any vector of counters c = (ci,..., Cd) G W 

and for any symbol a € E, 

DIST{i,j,c) = DIST{i + l,j + l,c). 

Proof Given strings X, Y in the alphabet E, and an integer fc, Abu-Khzam et 
al. [2 Corollary 1] proved that if A[l] = T[l], then: 

S{X,Y) < /c if and only if 5(X[2..|A|],y[2..|r|]) < k. 

Given that one option to transform X into Y with the minimum number of 
operations is to transform A[2..|A|] into T[2..|y|] with the minimum number of 
operations (matching A[l] with T[l]), we have: 

5(A,r) < S{X[2..\X\],Y[2..\Y\]). 

By selecting k = 6{X,Y), we obtain the equality 

d(A,y) = 5(A[2..|A|],y[2..|y|]). 

Then, since the symbol a = must be considered (because Cq, = 0), and 
= L[j], we can apply the above statement for X = S[i..n]c and Y = L[j..m] 
to obtain the next equalities: 

DIST{i,j,c) = 5{X,Y) = 5{X[2..\X\],Y[2..\Y\]) = DIST{i + l ,3 + l,c). 

The result thus follows. □ 

The second simplest case is when the first available symbol of S[i..n] is already 
matched (through swaps) to a symbol from L[l..j — 1]. The following lemma 
shows how to simply skip such a symbol: 

Lemma 2. Given S and L over the alphabet E, for any positions i G [l..n] in 
S and j G in L, and for any vector of counters c = {ci,... ,Cd) G W and 

for any symbol a € E, 

DIST{i,j, c) = DISTfi + 1, j, c — Wq). 


S')*] = a 1 
Ca ^0 J 


S[i\ = L[j] = a 
C/y - 0 


Proof. Since Ca > 0, the first Cq, symbols a of S[i..n] have been ignored, thus 
is ignored. Then, DIST(i,j,c) must be equal to DIST(i + l,j,c—Wa), case 
in which — 1 symbols a of + l..n] are ignored. □ 

The most important case is when the first symbols of S[i..n\ and L[j..m\ 
do not match: the minimum “path” from S to L can then start either by an 
insertion or a swap operation. 


Lemma 3. Given S and L over the alphabet E, for any positions i G [l..n] 
in S and j G in L, and for any vector of counters c = (ci,... ,Cd) G 

W, note a, P G E the symbols a = 5'[f] and (3 = L[j], r the position r = 
seleci{S,rank{S,i,/3) + Ci 3 + 1, (3) in S of the {cis + l)-th symbol /3 of S[i..n], and 
A the number minjce, rank{S, r, 9) — rank{S, i — 1,6 *)} of symbols ignored 
in 5'[i..r]. 

If a ^ [3 and Ca = 0, then DIST(i, j,c) = min{(ims, dsiuaps}, where 

_(DIST{i,j + l,c) + lifcp=0 

- 1+00 ifcf )>0 


and 

, _ ( {r — i) — A + DIST{i,j + l,c+ W 13 ) if r ^ 0 

dswaps - 1+00 ifr = 0 . 

Proof. Let = S[i..n]c. Given that a ^ (3 and Cq = 0, there are two 

possibilities for DIST{i,j,c): (1) transform 5''[l..n'] into L[j + l..m] with the 
minimum number of operations, and after that insert a symbol /3 at the first 
position of the resulting S"[l..n']; or (2) swap the first symbol (3 in 5''[2..n'] from 
left to right from its current position r' to the position 1 performing r' — 1 swaps, 
and then transform the resulting 5''[2..n'] into L[j + l..m] with the minimum 
number of operations. Observe that option (1) can be performed if and only if 
there is no symbol (3 ignored in 5'[i..n] (see Observation [T]) . If this is the case, 
then DIST{i,j,c) = DIST(i,j + l,c) + 1. Option (2) can be used if and only if 
there is a non-ignored symbol /3 in 5'[f..n], where the first one from left to right 
is precisely at position r = select{S,rank(S,i, /3) + cp + l,/3). In such a case 
r' = (r — i + 1) — A, where A = X]6»=i nrinjcg, rank{S, r, 0) — rank[S, * — 1,0)} 
is the total number of ignored symbols in the string S'[z..r]. Hence, the number 
of swaps counts to r' — 1 = (r — z) — Z\. Then, the correctness of dms, dswaps, 
and the result follow. □ 


The next two lemmas deal with the cases where one string is completely 
processed. When L has been completely processed, either the remaining symbols 
in S have all previously been matched via swaps and the distance equals zero, 
or there is no sequence of operations correcting S into L: 


Lemma 4. Given S and L over the alphabet E, for any positions i G [1..7z -1- 1] 
in S and j G [1..to] in L, for any vector of counters c = (ci,... ,Cd) G W, 


DIST{i, m+ l,c) 


0 z/ci-|-...-|-Cd = rz — z-1-1 and 

-1-00 otherwise. 


Proof. Note that DIST{i,m + l,c) is the minimum number of operations to 
transform the string into the empty string L[m + This number is 

null if and only if all the n — i + \ symbols of S[i..n\ have been ignored, that is, 
Cl + ... + Cd = n — i + 1. If not all the symbols have been ignored, then such a 
transformation does not exist and DIST{i,m + l,c) = +oo. □ 

When S has been completely processed, there are only insertions left to per¬ 
form: the distance can be computed in constant time, and the list of corrections 
in linear time. 


Lemma 5. Given S and L over the alphabet E, for any position j G 1] 

in L, and for any vector of counters c = (ci,, Cd) G W, 


DIST(n+l,j,c) 


m — j + 1 if c = 0 and 
-l-cxD otherwise. 


Proof. Note that DIST{i,m + l,c) is the minimum number of operations to 
transform the empty string 5'[n -|- l..n] into the string L[j..m]. If c = 0, then 
DIST{n + l,j,c) < -boo and the transformation consists of only insertions which 
are m — j -I- 1. If c 7^ 0, then DISTfn + 1, j, c) = -boo. □ 


3.3 Complete algorithm 

In the following, we describe the formal algorithm to compute DIST{i,j, 0). We 
consider the worst scenario for the running time of Theorem [TJ where for each 
symbol a G E we have ga > 0. The other cases in which > 0 is not satisfied for 
all a G if are easier to implement. Note that the line[3]of algorithm Compute and 
line[26lof algorithm DIST2 guarantee that DIST2{i,j) = DIST{i,j,c) < -boo 
in every call of DIST2. Further, the counters (ci, C2, ... ,Cd) are global variables 
to the recursive DIST2. 


3.4 Complexity Analysis 

Combining Lemmas [T] to [5l the value of DIST{1,1,0) can be computed recur¬ 
sively, as shown in the algorithm of Figure [TJ We analyze the formal complexity 
of this algorithm in Theorem (TJ in the finest model that we can define, taking 
into account the relation for each symbol a G E between the number Uq, of 
occurrences of a in S' and the number TOq, of occurrences of a in L. 

Theorem 1. Given two strings S and L over the alphabet E, for each symbol 
a G E, note Ua the number of occurrences of a in S and the number of 
occurrences of m in L, their sums n = ni + ■ ■ ■ + Ud and m = mi -b • • • -b rud, 
and ga = min{nQ,, ma — Ua} a measure of how far is from mal‘2-. There is an 
algorithm computing the Swap-Insert Correction distance 6{S, L) in time 


Algorithm DIST{i, j,c = (ci,... ,Cd)) 

1. if DIST{i, j,c) = + 0 O then 

2. return +oo 

3. else if i = n + 1 then 

4. (* insertions *) 

5. return m — j + 1 

6. else if j = m + 1 then 

7. (* skip all symbols since they were ignored *) 

8. return 0 

9. else 

10. a •<— S[i], P •<— L[j] 

11. if Cq > 0 then 

12. (* skip S[i], it was ignored *) 

13. return DIST{i + I, j,c — Wa) 

14. else it a = j3 then 

15. (* 5 ( 1 ] and L[j] match *) 

16. return Z)75'r(i + l,j + l,c) 

17. else 

18. fiins ^ “too, dsvjaps ^ “t“00 

19. if C ;3 = 0 then 

20. (* insert a at index i *) 

21. dir,s^l + DIST{i,j + l,c) 

22. r •<— select{S, rank{S, i, l3) + cp + 1, (5) 

23. if r yt null then 

24. A •<— min{ce, ranfc(5', r, 9) — rank{S, i — 1,6)} 

25. (* swaps *) 

26. dswaps <— {r — i) — A + DIST{i,j + l,c + wp) 

27. return min{dins,dsmaps} 

Fig. 1: Informal algorithm to compute DIST{i,j,c): Lemma |4] and Lemma [5] 
guarantee the correctness of lines [T] to [S] Lemma [H guarantees the correctness 
of lines HDtoini Lemma [1] guarantees the correctness of lines [M] to [161 and 
Lemma [3] guarantees the correctness of lines fT^ to [27l 


within 0{d -t m) if S and L have no symbol in common, and otherwise in time 
within 

O j d(n -t m) -t d?n •E 9a) ■ (ffa "h 1) j , 

y a=l j 

where 17+ = {a € E : ga > 0} if ga = 0 for any a € E, and E^ = E \ 
{argmin^gi; (/a} otherwise. 

Proof. Observe first that there is a reordering of 27 = [l..(i] such that 0 < 
5i < S2 < • • • < 5s and gs+i = gs +2 = = 5d for some index s G [0..d], 

and we assume such an ordering from now on. Note also that given any string 
X G {S', L}, a simple 2-dimensional array using space within 0{d ■ |A|) can be 
computed in time within 0{d ■ |Ar|), to support the queries rank{X,i, a) and 


Algorithm Compute S{S,L): 

1. preprocess each of S and L for rank and select 

2. (Cl, C2, . . . , Cd) ■<— 0 

3. return if DIST{1, 1, 0) < +oo then DIST2{1, 1) else +oo 


Fig. 2: Calling Algorithm to compute DlST{i,j,0), filtering degenerated cases before 
launching the real computation of the distance. 


Algorithm DIST2{i, j)\ 


1. 

p 4- 

the first index in [l..d] so that Cp 

= 0 

2. 

for 

a = 1 to d do 


3. 


if zza < rria — Ua then 


4. 


XoL Col 


5. 


else 


6. 


Xa ■<— rank{L, j — 1, a) — rank{S, i 

7. 

(n, 

• • • iTd— 1) (^1) • • • ) X-p—\ , Xp-j-l, . . 

.,Xd) 

8. 

k ^ 

■ j - i- {ri-\ -h Td-i) 



9. 

10 . 
11 . 
12 . 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20 . 
21 . 
22 . 

23. 

24. 

25. 

26. 

27. 

28. 

29. 

30. 

31. 

32. 

33. 

34. 

35. 


if T[p, i,k,ri,..., Vd-i] undefined then 
return T[t,i, fc, n, ... ,rd-i] 

else 

if i = n + 1 then 

T\p, i,k,ri,..., Td-i] ^ m - j + 1 
else if j = m + 1 then 

T[p,i, fc,ri,... ,rd-i] ^ 0 

else 

a ^ ^[z], /3 ■(- L[j] 
if Ca > 0 then 

Ca i Ca 1 

T[p,i,k,ri,.. .,rd-i] ■«- DIST2{i + l,j) 

Ca Ca 1 

else if a = /3 then 

T\p,i, k,ri,... , Td-i] -f- DIST2{i + l,j + 1) 

else 

dins ^ “t“00, dswaps ^ “t“00 

if C ;3 = 0 and count{S, i, /3) < count{L, j, fd) then 
dins ^— 1 + DIST2{i^ j + 1) 
r •<— select{S, rank{S, i, p) + cp + 1, p) 
if r ^ null then 

A •<— niin{ce, rank{S, r, 6) — rank{S, i — 
cp cp-\-l 

dsniaps (r - i) - A + DIST2{i,j + 1) 

C/3 •<— C/3 — 1 

z, /z, ri,..., i min-l^diTis, dswaps } 

return T[p, i,k,ri,..., rd-i] 


1 ,^)} 


Fig. 3: Formal Algorithm to compute DIST{i,j,0), using dynamic programming with 
memorization. Note that the line [26] of algorithm Compute and line [3] of algorithm 
DIST2 guarantee that DIST2{i, j) = DIST{i, j,c) < +oo in every call. 


select{X,k,a) in constant time for all values of i S k € [l..|X|], and 

ae S. 

The case where the two strings S and L have no symbol in common is easy: 
the distance is then +oo. The algorithm detects this case by testing if = 0 
for all a G E, in time within 0{d + m). 

Consider the algorithm of Figure [TJ and let i G j € and c = 

(ci,..., Cd) be parameters such that DIST{i, j,c) < +oo. 

At least one of the Ci,..., is equal to zero: in the first entry DIST{1, 1, 0) 
all the counters ci, C2 ,..., are equal to zero, and any counter is incremented 
only at line 1261 in which another counter must be equal to zero because of the 
lines [TT] and [Ml 

The number of insertions counted in line 1211 in previous calls to the func¬ 
tion DIST in the recursion path from DIST{1, 1,0) to DIST{i,j,c), is equal 
to j — * — (ci -I- • • • -I- Cd)- Let ta denote the number of such insertions for the 
symbol a G E. Then, we have 

j = i + {ci -\- ■ ■ ■ + Cd) + {ti + ■ ■ ■ + td), 

and for all a G E, Ca < Ua, ta < m-a — Ua, and 

Ca + ta = rank{L,j — l,a) — rank{S,i — l,a). 

Using the above observations, we encode all entries DIST{i,j,c), for i,j and 
c such that DIST{i,j,c) < -foo, into the following table T ofs-|-2 < d-|-2 
dimensions. If we have s = d, then 

T[p,i,k,ri,...,rd-i] = DIST{i,j,c={ci,...,Cd)), 

where 

Cp 


k 

Furthermore, given any combination of values i, j, ci,..., Cd we can switch to 
the values p, i, fc, ri,..., rd-i, and vice versa, in time within 0{d). Otherwise, if 
s < d, then 


= 0 , 


{X\ , - ■ • , 1, 5 • ■ • ; ^d) 

Ca if ria < ma — n, 


ta if rUa — na < no 


for every a G E, and 


— (ci -f • • • -f Cd) -f (tl -f • • • -f td) ~ (^1 + ■ • • + Td-l)- 


T[i,k,ri,... ,rs] = DIST{i,j,c={ci,...,Cd)), 

where (ri,..., r^) = (xi,..., Xs)- Again, given the values *, j, ci,..., Cd we can 
switch to the values i,k,ri,..., Vg, and vice versa, in 0{d) time. 

Since p G [l..d], i G [l..n -f 1], k G [0.. X]a=i(™a — ffa)], and Va G [0..ga] for 
every a, the table T can be as large as d x (n-|-1) x (1 + X]a=i(”^a “5a)) x (52 + 
1) X • • • X (gd + 1) if s = d, and as large as (n -|-1) x (1 -|- X]a=i (^a - 5a)) x (51 -|- 


1) X • • • X ((/s +1) if 0 < s < d. For s = 0, no table is needed. The running time of 
this new algorithm includes the 0{d(ri + m)) = 0(dm) time for processing each 
of S and L for rank and select, and the time to compute DIST{1, 1,0) which 
is within 0{d) times n + m plus the number of cells of the table T. If s = d, the 
time to compute DIST{1, 1,0) is within 

O id{n + m) + d^n ■ ^ (ma - 5a) ' (52 + 1). idd + 1) 

\ a=l 

Otherwise, if 0 < s < d, the time to compute DIST{1, 1,0) is within 

/ d 

O I d{n + m) + dn ■ E {nia - 5a) • (51 + 1).(5s + 1) 

\ a=l 

The result follows by noting that: if s = d, then IJ+ = {2,..., d}. Otherwise, if 
s < d, then = {1 ,..., s}. □ 

The result above, about the complexity in the worst case over instances with 
d, ni,..., Hd, TOi,..., irid fixed, implies results in less precise models, such as in 
the worst case over instances for d,n,m fixed: 




Corollary 1. Given two strings S and L over the alphabet E, of respective 
sizes n and m, the algorithm analyzed in Theorem[I\ computes the Swap-Insert 
Correction distance S{S,L) in time within 


O d{n + m) -I- d?'n{m — n) 


d- 1 


-f 1 


d-lN 


which is within O {n + m + n'^{m — n)) for alphabets of fixed size d; and within 
O |^d(n + m) + d^n^ + 1 

which is within O (^n + m + n^{m — for alphabets of fixed size d. 

Proof. We use the following claim: If a > 1 and x < y, then (o -I- y){x -f 1) < 
(a + x)(y -\- I). It can be proved as follows: 

(a — l)a; < (a — 1)5 
ax + y < ay + X 

ax + y-\-a + xy<ay + x + a + xy 
{a + y){x + 1) <{a + x){y + 1). 

Let T'_|_ C T" be as defined in Theorem [U and consider the worst scenario for the 
running time, that is, let us consider w.l.o.g. that 11+ = [2..d]. Let fi S 11+ be a 







symbol such that — np < and define a and b such that: 


and 


Note that: 


d d 

a= go) - {rnp - gp) = '^{rria - ga) - np 

= X! {"f^a-ga) > 1 , 

aGi:\{/3} 

b = n 

aei:+\{^} 


'^(iTia - go)- {ga + 1) = {a + np) ■ h ■ {mp -np + l) 
aeSjr 

< (a + mp — np) ■ b ■ {np + 1), 


OL — 1 


which immediately implies 


'^{ma-ga)' (ffa + 1) < {m - n) (n^ + 1). 

aei:+ aei;+ 


a—1 


Then, 


o \d^n- - ga)- (Sa + 1) ^ O {(fn{m - n) ■ (n 2 + 1).(n^ + 1)) 

J 

d-1 


C O (fn{m — n) 


n2 H-h rid 


+ 1 


C O I d^n{m — n) • ( - —j- + 1 


n 


Using similar arguments, we can prove that 

d 

'^{ma-ga)- (ffa + 1) < « (ma-n„ + l), 

a—1 


which implies the second part of the result. 


4 Discussion 

In 2014, Meister [4] described an algorithm computing the Swap-Insert Cor¬ 
rection distance from a string S G [l-.d]" to another string L G [l-.d]™ on any 





fixed alphabet size d > 2, in time polynomial in n and m. The algorithm that we 
described takes advantage of instances where for all symbols a G S the number 
rid of occurrences of a in S' is either close to zero (i.e. most a symbols from L are 
placed in S through insertions) or close to the number nia of occurrences of a 
in h (i.e. most a symbols from L are matched to symbols in S through swaps), 
while still running in time within 0 (m + min — n), n?{m — /r)‘^~^}) when 

the alphabet size d is a constant, in the worst case over instances composed of 
strings of sizes n and m. The exact running time of our algorithm is within 

O j d{n + m) + (fn •E (vrid Qa) ' i^9ct T 1) I , 
y a=l aes+ J 

where and are the respective number of occurrences of symbol a G [l..d] in 
S and L respectively; where the vector formed by the values gd = min{nti, nid — 
Hd} measures the distance between (m ,... ,na) and (mi, ...,Too-); and where 
E+ = {a G : ga > 0} if = 0 for any a G S, and 17+ = S \ {arg miucgi; gd} 
otherwise. 

Summarizing the disequilibrium between the frequency distributions of the 
symbols in the two strings via the measure g = maxa+i; < n, this yields a 
complexity within 0{(Pnmg‘^~^), which is polynomial in n and m, and exponen¬ 
tial only in d of base g. Since this disequilibrium g is smaller than the length n 
of the smallest string S, this implies a worst case complexity within 0{(Pmn'^) 
over instances formed by strings of lengths n and m over an alphabet of size d, 
a result matching the state of the art [4] for this problem. 


4.1 Implicit Results 

The result from Theorem [T] implies the following additional results: 

Weighted Operators: Wagner and Fisher [7] considered variants where the 
cost Cins of an insertion and the cost Cswap of an swap are distinct. In the 
Swap-Insert Correction problem, there are always n — m insertions, 
and always 5[S, L) — n + m swaps, which implies the optimality of the algo¬ 
rithm we described in such variants. 

Computing the Sequence of Corrections: Since any correct algorithm must 
verify the correctness of its output, given a set C of correction operators, 
any correct algorithm computing the String-to-String Correction Dis¬ 
tance when limited to the operators in C implies an algorithm computing a 
minimal sequence of corrections under the same constraints within the same 
asymptotic running time. 

Implied improvements when only swaps are needed: Abu-Khzam et al. [1] 

mention an algorithm computing the Swap String-to-String Correc¬ 
tion distance (i.e. only swaps are allowed) in time within O(n^). This is a 
particular case of the Swap-Insert Correction distance, which happens 
exactly when the two strings are of the same size n = m (and no insertion 


is neither required nor allowed). In this particular case, our algorithm yields 
a solution running in time within 0(dm), hence improving on Abu-Khzam 
et al.’s solution [T]. 

Effective Alphabet: Let d' be the effective alphabet of the instance, i.e. the 
number of symbols a oi E = [l..d] such that the number of occurrences of 
a in S' is a constant fraction of the number of occurrences of a in L (i.e. 
Ua S 0{ma)). Our result implies that the real difficulty is d' rather than d, 
i.e. that even for a large alphabet size d the distance can still be computed 
in reasonable time if d' is finite. 


4.2 Perspectives 

Those results suggest various directions for future research: 

Further improvements of the algorithm: our algorithm can be improved 
further using a lazy evaluation of the min operator on line 1271 so that 
the computation in the second branch of the execution stops any time the 
computed distance becomes larger than the distance computed in the first 
branch. This would save time in practice, but it would not improve the 
worst-case complexity in our analysis, in which both branches are fully ex¬ 
plored: one would require a finer measure of difficulty to express how such a 
modification could improve the complexity of the algorithm 

Further improvements of the analysis: The complexity of Abu-Khzam et 
al.’s algorithm [I], sensitive to the distance from S to L, is an orthogonal 
result to ours. An algorithm simulating both their algorithm and ours in 
parallel yields a solution adaptive to both measures, but an algorithm using 
both techniques in synergy would outperform both on some instances, while 
never performing worse on other instances. 

Adaptivity for other existing distances: Can other String-to-String Cor- 
REGTION distances be computed faster when the number of occurrences of 
symbols in both strings are similar for most symbols? Edit distances such 
as when only insertions or only deletions are allowed are linear anyway, but 
more complex combinations require further studies. 
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